Combining Bilingual and Comparable Corpora for Low Resource Machine Translation

نویسندگان

  • Ann Irvine
  • Chris Callison-Burch
چکیده

Statistical machine translation (SMT) performance suffers when models are trained on only small amounts of parallel data. The learned models typically have both low accuracy (incorrect translations and feature scores) and low coverage (high out-of-vocabulary rates). In this work, we use an additional data resource, comparable corpora, to improve both. Beginning with a small bitext and corresponding phrase-based SMT model, we improve coverage by using bilingual lexicon induction techniques to learn new translations from comparable corpora. Then, we supplement the model’s feature space with translation scores estimated over comparable corpora in order to improve accuracy. We observe improvements between 0.5 and 1.7 BLEU translating Tamil, Telugu, Bengali, Malayalam, Hindi, and Urdu into English.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Neural Machine Translation for Low Resource Languages using Bilingual Lexicon Induced by Comparable Corpora

Automatically extracting parallel sentence pairs from the multilingual articles available on the Internet can address the data sparsity problem in building multilingual natural language processing applications, especially in machine translation. In this project, we have used an end-to-end siamese bidirectional recurrent neural network to generate parallel sentences from comparable multilingual ...

متن کامل

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

An Efficient Framework for Extracting Parallel Sentences from Non-Parallel Corpora

Automatically building a large bilingual corpus that contains millions of words is always a challenging task. In particular in case of low-resource languages, it is difficult to find an existing parallel corpus which is large enough for building a real statistical machine translation. However, comparable non-parallel corpora are richly available in the Internet environment, such as in Wikipedia...

متن کامل

Chinese-Portuguese Machine Translation: A Study on Building Parallel Corpora from Comparable Texts

Although there are increasing and significant ties between China and Portuguese-speaking countries, there is not much parallel corpora in the Chinese–Portuguese language pair. Both languages are very populous, with 1.2 billion native Chinese speakers and 279 million native Portuguese speakers, the language pair, however, could be considered as low-resource in terms of available parallel corpora...

متن کامل

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013